Statistics with Python

I wrote the following code snippets while reading the book "Think Stats" by Allen B. Downey which aim is given by its subtitle "Probability and Statistics for Programmers". Initially this notebook served for me as a playground to reproduce various topics of the book. Now it's a quick reminder of statistical concepts and syntax of tools for it.

(c) Florian Hoppe 2013


In [3]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as rand
from datetime import datetime, timedelta
import dateutil
from scipy.stats import pareto
import scipy.stats
import random as rnd

In [4]:
%pylab inline


Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [5]:
a, m = 1.7, 150.
s = Series(np.random.pareto(a, 10000000) + m)
print("%f < x < %f" % (min(s),max(s)))


150.000000 < x < 12014.795739

In [6]:
s.describe()


Out[6]:
count    10000000.000000
mean          151.427570
std             9.566633
min           150.000000
25%           150.184620
50%           150.503956
75%           151.261504
max         12014.795739
dtype: float64

In [7]:
count, bins, ignored = plt.hist(s, 100, normed=True)
fit = a*m**a/bins**(a+1)
plt.xlim([0,1000])
plt.plot(bins, max(count)*fit/max(fit),linewidth=2, color='r')
plt.show()



In [8]:
rv = pareto([.5,] * scipy.stats.pareto.numargs)

Different Probability Functions:


In [51]:
s_w = Series(np.random.weibull(2.5,100000))
s_n = Series(np.random.randn(100000))

In [52]:
hist_w = np.histogram(s_w,bins=100)
hist_n = np.histogram(s_n,bins=100)

In [53]:
plot(hist_w[0],'r',hist_n[0],'b')


Out[53]:
[<matplotlib.lines.Line2D at 0x1130eb6d0>,
 <matplotlib.lines.Line2D at 0x1130eb950>]

In [56]:
# Pearson skewness is better for real world data as it's more robust against outlieres:
def pearson_skew(arr):
    return 3*(arr.mean()-arr.median())/arr.var()

print("mean: %f, median: %f, skew: %f, Pearson skewness: %f" % (s_w.mean(), s_w.median(), scipy.stats.skew(s_w), pearson_skew(s_w)))


mean: 0.886366, median: 0.863186, skew: 0.356908, Pearson skewness: 0.481884

In [6]:
plt.bar(hist_n[1][:-1],hist_n[0],width=.1)


Out[6]:
<Container object of 100 artists>

In [7]:
CDF_w = Series(hist_w[0]).cumsum()
CDF_n = Series(hist_n[0]).cumsum()
plot(CDF_w,'r',CDF_n,'b')


Out[7]:
[<matplotlib.lines.Line2D at 0x1083e8810>,
 <matplotlib.lines.Line2D at 0x1083e8dd0>]

In [172]:
CDF_w.sum()


Out[172]:
9219905

In [14]:
nrm = scipy.stats.norm(100, 15)

In [15]:
plot(nrm.pdf(np.arange(-10,10,.1)),'r', nrm.cdf(np.arange(-10,10,.1)),'b')


Out[15]:
[<matplotlib.lines.Line2D at 0x108a9f3d0>,
 <matplotlib.lines.Line2D at 0x108a9f650>]

In [195]:
6*10**9*nrm.pdf(190)


Out[195]:
2.4303531399293141

In [203]:
n = 4
std([max(random.normal(950,50,n)) for _ in range(365)])


Out[203]:
35.350982073471727

In [11]:
#plot(random.normal(950,50,365),'g',random.normal(1002,35,365),'r')
norm950 = scipy.stats.norm(950, 50)
norm1000 = scipy.stats.norm(1002, 35)
xs = np.arange(800,1200)
plot(xs,norm950.pdf(xs),'r', xs,norm1000.pdf(xs),'b')


Out[11]:
[<matplotlib.lines.Line2D at 0x108a6f490>,
 <matplotlib.lines.Line2D at 0x108a6f6d0>]

In [233]:
flipping_15_coins = scipy.stats.binom(15,.5)
flipping_15_coins.pmf(10) # propability that 10 flips yield head


Out[233]:
0.091644287109375097

In [17]:
# creating a 
import scipy.stats as stats
x = np.linspace(0, 1, 100)
plot(x,stats.beta.pdf(x,3,5),x,stats.beta.pdf(x,5,5),x,stats.beta.pdf(x,5,3),x,stats.beta.pdf(x,1,5),x,stats.beta.pdf(x,5,1))


Out[17]:
[<matplotlib.lines.Line2D at 0x108fd0890>,
 <matplotlib.lines.Line2D at 0x108fd0ad0>,
 <matplotlib.lines.Line2D at 0x108fd4090>,
 <matplotlib.lines.Line2D at 0x108fd4590>,
 <matplotlib.lines.Line2D at 0x108fd4a90>]

In [74]:
ser_twonormals = Series(random.normal(10,1,size=(10000)) + random.normal(50,1,size=(10000)))
ser_twonormals.hist(bins=100)
ser_twonormals.describe()


Out[74]:
count    10000.000000
mean        59.999976
std          1.428755
min         54.414580
25%         59.036657
50%         59.987147
75%         60.971587
max         66.337055
dtype: float64

In [75]:
ser_twonormals = Series(np.concatenate([random.normal(10,1,size=(10000)), random.normal(50,1,size=(10000))]))
ser_twonormals.hist(bins=100)
ser_twonormals.describe()


Out[75]:
count    20000.000000
mean        29.995583
std         20.030437
min          6.066369
25%          9.990733
50%         29.506300
75%         49.998532
max         53.511374
dtype: float64

Hypothesis Testing


In [91]:
cd /Users/florianhoppe/Google Drive/Wissen/Books/thinkstats/thinkstats_resources


/Users/florianhoppe/Google Drive/Wissen/Books/thinkstats/thinkstats_resources

In [112]:
import survey_using_panda
preg = Pregnancies()
preg.ReadRecords()
#len(preg.records)


Out[112]:
13593

In [166]:
preg_columns = ['caseid','nbrnaliv','babysex','birthwgt_lb','birthwgt_oz','prglength','outcome','birthord','agepreg','finalwgt']
data = []
for record in preg.records:
    data += [[getattr(record,field) for field in preg_columns]]

data = DataFrame(data,columns=preg_columns)
print(data[:3])


   caseid nbrnaliv babysex birthwgt_lb birthwgt_oz  prglength  outcome  \
0       1        1       1           8          13         39        1   
1       1        1       2           7          14         39        1   
2       2        3       1           9           2         39        1   

  birthord agepreg      finalwgt  
0        1   33.16   6448.271112  
1        2   39.25   6448.271112  
2        1   14.33  12999.542264  

In [322]:
#data.birthord[data.birthord=='NA'] = np.NAN
data.birthord.value_counts()


Out[322]:
NA    4445
1     4413
2     2874
3     1234
4      421
5      126
6       50
7       20
8        7
9        2
10       1
dtype: int64

In [168]:
reported_birth = data.ix[data.birthord!='NA',:]

In [176]:
first_born = data.ix[data.birthord==1,:]
others = data.ix[(data.birthord!=1) & (data.birthord!='NA'),:]

In [327]:
field_to_check = 'finalwgt'#'prglength'
diff_of_means = getattr(first_born,field_to_check).mean()-getattr(others,field_to_check).mean()
# resampling
n = 1000
fb_samples_prglength = getattr(first_born,field_to_check).ix[first_born.index[rnd.sample(xrange(len(first_born)),n)]]
o_samples_prglength = getattr(others,field_to_check).ix[others.index[rnd.sample(xrange(len(others)),n)]]
diff_of_samples = Series(np.array(fb_samples_prglength.astype(float)) - np.array(o_samples_prglength.astype(float)))
diff_of_samples.hist(bins=100)
diff_of_samples.describe()
print("In %.0f %% cases the difference of the field %s is larger then the mean difference %f." % (float(100.0*(diff_of_samples>diff_of_means).sum())/n,field_to_check,diff_of_means))


In 57 % cases the difference of the field finalwgt is larger then the mean difference -503.481008.

In [333]:
first_born[['prglength','finalwgt']].describe()


Out[333]:
prglength finalwgt
count 4413.000000 4413.000000
mean 38.600952 8143.762911
std 2.791901 8270.404870
min 0.000000 118.656790
25% 39.000000 4033.606004
50% 39.000000 6456.845037
75% 40.000000 9496.575715
max 48.000000 261879.953864

In [329]:
others[['prglength','finalwgt']].describe()


Out[329]:
prglength finalwgt
count 4735.000000 4735.000000
mean 38.522914 8647.243919
std 2.615852 9880.977442
min 4.000000 374.737005
25% 39.000000 3963.834554
50% 39.000000 6365.031570
75% 39.000000 9598.147669
max 50.000000 261879.953864

In [331]:
first_born[['prglength','finalwgt']].describe()-others[['prglength','finalwgt']].describe()


Out[331]:
prglength finalwgt
count -322.000000 -322.000000
mean 0.078037 -503.481008
std 0.176049 -1610.572572
min -4.000000 -256.080216
25% 0.000000 69.771450
50% 0.000000 91.813467
75% 1.000000 -101.571954
max -2.000000 0.000000

Poisson Distribution


In [31]:
def tp():
    ser = Series(np.random.poisson(10,1000))
    return ser[ser<=10].count()

mean([tp() for _ in range(100)])


Out[31]:
582.76999999999998

In [3]:
ser = Series(np.random.poisson(10,1000))

In [35]:
ser.hist()


Out[35]:
<matplotlib.axes.AxesSubplot at 0x107f40750>

In [37]:
# it's right-side skewed
scipy.stats.skew(ser)


Out[37]:
0.3148389791854863

In [70]:
p_ser = []
for val in ser.unique():
    p_of_val = ser[ser==val].count()/float(len(ser))
    p_ser += [p_of_val]
    
plt.bar(list(ser.unique()),p_ser)


Out[70]:
<Container object of 20 artists>

In [63]:
ser.hist(bins=len(ser.unique()))


Out[63]:
<matplotlib.axes.AxesSubplot at 0x109757650>

In [43]:
rv = scipy.stats.poisson(10)
rv.cdf(10)


Out[43]:
0.58303975019298493

In [92]:
lam = 10
n = 10000
ser = Series(np.random.poisson(lam,n))
ser.describe()


Out[92]:
count    10000.000000
mean        10.071900
std          3.188189
min          1.000000
25%          8.000000
50%         10.000000
75%         12.000000
max         24.000000
dtype: float64

In [51]:
p_below_mean = ser[ser<=ser.mean()].count().astype(float)/len(ser)
print("Percent of the samples are below the mean: %.0f %%" % (p_below_mean*100))
print("This equals (by definition) the value of the cummultative density function of the same poisson distribution: %.0f %%" % (100*scipy.stats.poisson(lam).cdf(lam)))
print("And the inverse of the mass function of that percentage is again the lambda: %.1f" % scipy.stats.poisson.ppf(p_below_mean,lam))


Percent of the samples are below the mean: 58 %
This equals (by definition) the value of the cummultative density function of the same poisson distribution: 58 %
And the inverse of the mass function of that percentage is again the lambda: 10.0

In [109]:
scipy.stats.poisson.ppf(scipy.stats.poisson(lam).cdf(lam),lam) == lam


Out[109]:
True

In [99]:
sorted_ser = np.sort(ser)
plot(arange(0,1,.01),scipy.stats.poisson.ppf(arange(0,1,.01),lam),'rx',arange(0,1,1.0/len(sorted_ser)),sorted_ser,'b')


Out[99]:
[<matplotlib.lines.Line2D at 0x1097c2c90>,
 <matplotlib.lines.Line2D at 0x1097c7210>]

In [100]:
scipy.stats.poisson.ppf(.2,lam)


Out[100]:
7.0

In [102]:
# after sorting the index (id sorted_ser[i] or sorted_ser.ix[i]) still references the old order:
sorted_ser.values[.2 * len(sorted_ser)]


Out[102]:
7

In [142]:
dist_percentage = 0.97
lowb, highb = scipy.stats.poisson.interval(dist_percentage,lam)
print("To get at least %.0f %% of a poisson distribution with lambda %i the given event must be between %i and %i." % (dist_percentage,lam,lowb,highb))
print("For this interval the sample series contains %.1f %% events" % (ser[(ser>=lowb) & (ser<=highb)].count()*100.0/len(ser)))


To get 1 % of a poisson distribution with lambda 10 the given event must be between 4 and 17.
For this interval the sample series contains 97.4 % events

Hypothesis Testing


In [70]:
# testing if the number of occurences matches expected ones:

alpha = 0.05
observed = np.array([126,174]) # absolut value of occurences of two events
expected = np.array([3.0/7, 4.0/7]) * observed.sum() # expected are here primarily given as percents/ration
X_squared, p_value = scipy.stats.chisquare(observed,expected)
print("The observed occurences have a propability of %.1f %%. The X^2 difference between observed and expected is %.2f." % (100*p_value, X_squared))
print(" => with significance niveau %.2f the null hypothesis (observed mean matches assumed one) is %s" % (alpha,"accepted" if p_value>alpha else "rejected"))


The observed occurences have a propability of 76.4 %. The X^2 difference between observed and expected is 0.09.
 => with significance niveau 0.05 the null hypothesis (observed mean matches assumed one) is accepted

In [69]:
# t test: testing how a observed mean relates to a assumed (== true) mean ie. if it's equal, greater or less
# TODO add assumptions on distributions and ways to performe less and greater tests

true_mean = 10
alpha = 0.05
ser = random.normal(true_mean,1,size=(100000))
err,p_value = scipy.stats.ttest_1samp(ser,true_mean)
print("Given the sample series it's %.1f %% likely that the mean of the underlying distribution is %.0f. Error is %.4f" % (p_value*100,true_mean,err))
print(" => with significance niveau %.2f the null hypothesis (observed mean matches assumed one) is %s" % (alpha,"accepted" if p_value>alpha else "rejected"))
test_mean = 9.9
err,p_value = scipy.stats.ttest_1samp(ser,test_mean)
print("Given the sample series it's %.1f %% likely that the mean of the underlying distribution is %.0f." % (p_value*100,test_mean))
print(" => with significance niveau %.2f the null hypothesis (observed mean matches assumed one) is %s" % (alpha,"accepted" if p_value>alpha else "rejected"))


Given the sample series it's 68.3 % likely that the mean of the underlying distribution is 10. Error is 0.4081
 => with significance niveau 0.05 the null hypothesis (observed mean matches assumed one) is accepted
Given the sample series it's 0.0 % likely that the mean of the underlying distribution is 10.
 => with significance niveau 0.05 the null hypothesis (observed mean matches assumed one) is rejected

In [ ]: